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1. INTRODUCTION 

Hate speech is any communicative acts that used to express hatred towards a person or a group on 
the basis of some characteristic such as race, ethnicity, gender, sexual orientation, nationality, religion, or 
other characteristic [1]. Due to the massive increase of user-generated web content, in particular on social 
media networks where anyone can give a statement freely without any limitations, the amount of hateful 
activities is also increasing. Social media technology make people able to express their opinion, including 
hate speech, quickly, then spread widely and become viral if the topics covered are ‘interesting’. It can bring 
up disputes between groups in society. In Indonesia, based on the data of National Police Criminal 
Investigation Agency of Indonesia in 2015, there are 143 cybercrimes in the form of hate speech. This 
number increased to 199 in 2016. However, this data only cover hate speeches being criminalized and 
reported to the police. Obviously there are still many more hate speeches that exist in various social media. 
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One of the popular social media in Indonesia is Twitter [2]. Social media and microblogging web 
services, such as Twitter, allowing to read and analyze user tweets in near real time. Twitter is a logical 
source of data for hate speech analysis since users of twitter are more likely to express their emotions of an 
event by posting some tweet [3]. This analysis can help for early identification of hate speech so it can be 
prevented to be spread widely. It is also useful for content filtering and early detection of wrongful activities 
[4]. The manual way of detecting out hateful tweets is costly and not scalable. Therefore, the automatic way 
of hate speech detection is needed to be developed for tweets in Indonesian language. 

Some previous works proposed in hate speech detection mostly for English [5-7]. Most of them used 
machine learning technique and the dataset is from Twitter. Meanwhile, the study of hate speech detection in 
Indonesian language is still very rare. As far as we know, [8] and [9] are the only works in hate speech 
detection in Indonesian language. These works provide datasets for hate speech detection in Indonesian 
language from Twitter. These works also used machine learning approach to tackle this problem. Basically 
we also consider the hate speech detection as a text classification problem. In this work, we focus on the 
problem of classifying a tweet as hate speech or not. Text classification technique mostly using bag of words 
features and machine learning methods such as Naive Bayes (NB) [10], K-Nearest Neighbors (KNN) [11], 
Maximum Entropy (ME) [12], Random Forest (RF) [13], or Support Vector Machines (SVM) [1] for 
classification task. 

In this works, we used ensemble method to tackle this problem. An ensemble of classifiers is a set of 
stand-alone classifiers which combined to classify new tweet in order to improve classification performance 
[14]. In general text classification, several works using ensemble method have been conducted and reported 
that ensembles method can enhance the classification performance (e.g. [15-17]). Several classifier that been 
used in this ensemble are NB, KNN, ME, RF, and SVM. We aim to improve the performance of some stand- 
alone classifiers by combining them. 


2. RESEARCH METHOD 
As seen in Figure 1, hate speech detection in this work consists of three main stages: 1) 
preprocessing; 2) training some stand-alone classifiers; and 3) combining the classifiers. 
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Figure 1. Hate Speech Detection Flowchart 








2.1. Tweet Preprocessing 

In tweet preprocessing, there are some steps to be conducted: 1) tokenization; 2) filtering; 3) 
stemming; and 4) term weighting. Tokenization is a task of splitting tweets into smaller units called tokens or 
terms. In this process, case folding and cleansing are also conducted. Case folding is a process of converting 
all of characters into lowercase. In the cleansing process, punctuation, numbers, html tag and characters 
outside of the alphabet were removed. The next step is filtering or Stopwords Removal. Stopwords or 
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uninformative words were removed in this step based on an existing stoplist dictionary. In this work, we 
stoplist dictionary by Tala [18]. The fourth step is stemming or a process of reducing every words to its root 
or base form. The words ‘dilawan’, ‘melawan, and ‘perlawanan will be converted to the same word ‘lawan’ 
[19]-[21]. 

The last step in preprocessing is word or term weighting. In this work, we use bag of words (BOW) 
features with TF.IDF weighting. TF.IDF is the most popular term weighting method in text classification 
[22]. TF.IDF is a combination of term frequency (TF) and inverse document frequency (IDF). The TF.IDF 
weight of term or word t in tweet or document d is calculated as follows: 


TF -IDF(t,d)=(1+ log(f,., ) ol lof “2 


t 


where i tq 18 the number of occurrences of term t in tweet d and N q is the number of tweets in corpus and 


df, is the number of tweets in corpus that contains term t. Finally, this stage produce a bag of words (BOW) 


features which will be used in the next stage. 


2.2. Training Some Stand-alone Classifiers 

In the second stage, several popular classifiers is trained. In this work, we used Naive Bayes, K- 
Nearest Neighbours, Maximum Entropy, Random Forest, and Support Vector Machines. For Naive Bayes, 
we used Multinomial distribution as it proves to show good performance in text classification. Meanwhile, 
for SVM, we used Linear kernel for the same reason. Finally, the classifiers is ensembled in the last stage. 


2.3. Combining the Classifiers 

In the last stage, several classifiers from the previous stage is combined. We conducted two types of 
ensemble methods: 1) hard voting; and 2) soft voting. In hard voting, each stand-alone classifier has one vote. 
As seen in Figure 2, the category of a tweet is selected by majority voting. The category selected is the one 
which have a majority, that is, more than half the votes. Meanwhile in soft voting, average category 
probabilities is used as voting score. As seen in Figure 3, the final category of a tweet is the one with the 
highest voting score or average probability from each classifiers. 
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Figure 2. Hard Voting Ensemble Method 
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Figure 3. Soft Voting Ensemble Method 
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3. RESULTS AND ANALYSIS 

We implemented the experiments using Scikit-Learn [23]. We used Twitter hate speech dataset in 
Indonesian language that have been collected and labelled by [9]. There are 260 tweets labelled as hate 
speech and 445 tweets labelled as non hate speech. We kept the dataset unbalanced in the first experiment. 
For the second experiment, we transform the unbalanced dataset into a balanced one using an undersampling 
method. We choose non hate speech tweets randomly so that the numbers of the non hate speech tweets 
become the same number of the hate speech tweets. 

In the experiments, we compared the results of stand-alone classifiers with our ensemble method. 
We use 10 fold cross validation, which is mean the dataset is equally divided into 10 folds first. In each 
iteration of cross validation, tweets from 9 folds were used as training data and the remaining fold was used 
as testing data. We use average Fl Measure as the evaluation method in this experiments. Experiment results 
displayed in Figure 4 and Figure 5. 


e using Hate Speech Detection Performance 
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Figure 4. Hate Speech Detection Performance using Figure 5. Hate Speech Detection Performance using 
Unbalanced Dataset Balanced Dataset 


As seen in Figure 4, among all stand-alone classifiers, NB has the best performance on unbalanced 
dataset compared with other stand-alone classifier by 78.3% F1 measure. SVM performed almost the same to 
NB with Fl measure 78.1%. It is clear to see that KNN was the most inferior classifier with only 74.2% F1 
measure. Meanwhile, RF and ME performed better than KNN by 71.2% and 74.3% F1 measure respectively. 

Almost all of the ensemble methods have higher Fl measure over stand-alone classifiers on 
unbalanced dataset. However, on the hard voting strategy with 5 classifiers (NB, KNN, ME, RF, SVM), 
whose F1 measure is 77.9%, the ensemble methods can not exceed the NB performance. The decision in hard 
voting is equally determined by all of stand-alone classifiers. The Fl measure of hard voting method usually 
varies between the Fl measure of best classifier and the F1 measure of worst classifier. It is hard for the hard 
voting method to get higher F1 measure than the best classifier beacuse the difference in Fl measure is too 
far between the best classifier (78.3%) and the worst classifier (68.2%) that been combined. It is not 
happened when we use soft voting method. Soft voting method with 5 classifiers still surpass the 
performance of all stand-alone classifiers by 78.9% F1 measure. Although combining all of the classifiers, 
soft voting give votes for each category based on its average probability value from all of the classifiers. 
There is a possibility that winning categories based on hard voting will lose on soft voting because they have 
lower averages probability than other category. Soft voting simply provides a more robust voting scheme as 
it is often reduces overfit and creates a smoother model. 

The ensemble methods by using only three best classifiers (NB, SVM, and RF) have the best 
performance when using hard voting or soft voting. Hard voting and soft voting based on this scheme have 
the same Fl measure, 79.8%. Since ensemble method is affected by the classifiers that compiled it, using 
only the best classifier can improve the possibility of ensemble method to get better performance. 

Meanwhile, the result of the second experiment, which is using balanced dataset, can be seen in 
Figure 5. As predicted, all of the classification method got higher Fl measure on balanced dataset. KNN is 
still the worst classifier with Fl measure 76.8%. RF perfomed slightly better with 77.6% F1 measure. 
Surprisingly, ME has the best performance with 84.1% Fl measure value. NB and SVM are still below ME 
with only slight difference. All of the ensemble method have almost the same Fl measure value and also 
perform better than almost all of the stand-alone classifiers. These two exeperiments showed that we can 
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improve the performance by using ensemble method even if not significant. Nevertheless, ensemble method 
surely reduce the jeopardy of choosing a week classifier to be used for detecting new tweets. 


4. CONCLUSION 

In this study, we we used ensemble method to for Hate Speech Detection in Indonesian language. 
We employed five stand-alone classification algorithms, including Naive Bayes, K-Nearest Neighbours, 
Maximum Entropy, Random Forest, and Support Vector Machines, and two ensemble methods on Twitter 
hate speech dataset. 

By using unbalance dataset, the experiment results show that Naive Bayes offered the best 
performance among all five stand-alone classifiers with Fl measure value 78.3%. The experimental results 
also show that ensemble technique can improve the classification performance. The best result is achieved 
when using ensemble of three best classifier (Naive Bayes, Support Vector Machine, and Random Forest) 
with Fl measure 79.8%. 

Meanwhile, as predicted, all of the classification method got higher Fl measure when using 
balanced dataset. Surprisingly, Maximum Entropy has the best performance in this second experiment with 
84.1% Fl measure value. Using balanced dataset, all of the ensemble method have almost the same F1 
measure value and also perform better than almost all of the stand-alone classifiers. These two exeperiments 
showed that using ensemble method can improve the performance of the system. Although the improvement 
is not significant, using ensemble method can reduce the risk of selecting a poor classifier to be used for 
detecting new tweets as hate speech or not. 

In the future work, instead of only using BOW features, applying ensembles of feature set maybe a 
promising direction to get better performance. Some feature set such as n-gram, lexicon, POS tagging, texual 
feature or twitter specific features can be applied for improvement. Another types of feature like Word2Vec 
or Paragraph2Vec also can be applied in the future. 
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